Accessing data through APIs

Learning Goals

After completing this lesson you will be able to

  • define what an API is
  • describe the difference between human-readable and machine-readable data structures
  • understand the relationship between using APIs and reproducible data pipelines
  • know several examples of APIs that provide data from official sources
  • understand what an API request does

Background

Up to this point, we have been downloading data from a website and you have been reading the data manually into Python. This works but is not very efficient and does not explicitly link your data and your analysis.

It is better to automate this process using Python. Automation is particularly useful when (CU Boulder, 2020):

  • You want to download lots of data or particular subsets of data to support an analysis.
  • There are programmatic ways to access and query the data online.

When you automate data access, download, or retrieval, and embed it in your code, you are directly linking your analysis to your data. Further, combined with Jupyter Notebooks, code comments and expressive coding techniques, you are better documenting your workflow.

In short - by linking data access and download to your analysis - you are not only reminding your future selves of your process - you are also reminding your future self where (and how) you got the data in the first place! Similarly, this allows your workflow to be easily reproduced by others.

Two Key Formats

The data that you access programmatically may be returned in one of two main formats:

  1. Tabular Human-readable file: Files that are tabular, including CSV files (Comma Separated Values) and even spreadsheets (Microsoft Excel, etc.). These files are organized into columns and rows and are “flat” in structure rather than hierarchical.
  2. Structured Machine-readable files: Files that can be stored in a text format but are hierarchical and structured in some way that optimizes machine readability. JSON files are an example of structured machine-readable files.

What is an API?

An API (Application Programming Interface) is a way for computers to talk to each other through a common set of instructions.

APIs are everywhere and are the thing that makes the web, work. For example, every time an app on your phone is loading data from a server, it uses an API to do so. If you check your bank balance, the banking app, will send a request to the banking server and return a hopefully sufficiently large number.

What is an API, GeeksForGeeks

Three Parts of an API Request

When we talk about APIs, it is important to understand two key components: the request and the response. The third part listed below is the intermediate step where the request is PROCESSED by the remote server.

  1. Data REQUEST: You try to access a URL in your browser that specifies a particular subset of data.
  2. Data processing: A web server somewhere uses that URL to query a specified dataset.
  3. Data RESPONSE: That web server then sends you back some content.

The response may give you one of two things: - Some data or - An explanation of why your request failed

(CU Boulder, 2020)

Environmental Data APIs

Because manually downloading data is cumbersome and error-prone, most environmental data providers maintain APIs or other automated portals for downloading data.

Examples of Environmental APIs

One example of this is the U.S. Environmental Protection Agency’s Air Quality System (AQS) API:

AQS contains ambient air sample data collected by state, local, tribal, and federal air pollution control agencies from thousands of monitors around the nation. It also contains meteorological data, descriptive information about each monitoring station (including its geographic location and its operator), and information about the quality of the samples. More about AQS. Note, AQS does not contain real-time air quality data (it can take 6 months or more from the time data is collected until it is in AQS).

There are several other publicly available APIs:

API Documentation

Ideally, available API services and conditions for use are well documented. There is an example of the online documentation document for the EPA Air Quality API.

Using the AQS API

API requests are sent over the internet by constructing an API call that contains a number of predefined parameters.

The below example shows an API call that will return SO2 monitors at a specific location in Hawaii County, HI.

Example; returns list of SO2 monitors at the Hawaii Volcanoes NP site (#0007) in Hawaii County, HI that were operating on May 01, 2015.
(Note, all monitors that operated between the bdate and edate will be returned):
https://aqs.epa.gov/data/api/monitors/bySite?email=test@aqs.api&key=test&param=42401&bdate=20150501&edate=20150502&state=15&county=001&site=0007

The specified parameters for this call are:

email: email of the user
key: api key to identify the user
param: a number code for the desired parameter (here SO2)
bdate: begin date
edate: end date
state: a number code for the state 
county: a number code for the county 
site: a number code for the site 

Try what happens if you copy this API call into a web browser!

API Keys

Access to APIs can be restricted through API keys that authenticate individual users. While API’s from official government sources often do not require authentication (or only when reaching a certain number of requests), many private services (e.g. Amazon Web Services) will charge for API Access.

A somewhat recent case, was when Twitter became X and started charging for its API that allows for pulling of Tweets

Pricing is tiered with > a limited Free tier, a Basic tier (approx. $200/month), a $5,000/month Pro tier, and high-volume Enterprise access starting around $42,000/month. (X.com - API pricing)

This means that publicly sharing API keys can become VERY costly for some services.

Warning

Don’t upload API Keys to publicly accessible GitHub repositories.

Abuse of API Keys may lead to being banned from the service or in the case of commercial APIs – that charge for calls – to costly surprises.

Having API-Keys available in public (e.g. github) can have really bad consequences; via reddit.com

Acknowledgements

This lecture is partially based on: